library(tidyverse)
interactive plot using fifa18 dataset : here we are trying to see if age group has any effect on the potential, shot and passing capabilities of the player.
fifa18 dataset contains 17076 rows and 40 columns depicting different attributes of the player like stamina,potential, ability to kick, pass and display aggressiona nd balance.
fifa <- read_csv("C:/Susmitha Chereddy/Data_visualization/Mini_project_2_chereddy/data/fifa18.csv")
## Rows: 17076 Columns: 40
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): name, nationality, club
## dbl (37): age, overall, potential, acceleration, aggression, agility, balanc...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
fifa
## # A tibble: 17,076 × 40
## name nationality club age overall potential acceleration aggression
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Cristiano … Portugal Real… 32 94 94 89 63
## 2 L. Messi Argentina FC B… 30 93 93 92 48
## 3 Neymar Brazil Pari… 25 92 94 94 56
## 4 L. Suárez Uruguay FC B… 30 92 92 88 78
## 5 M. Neuer Germany FC B… 31 92 92 58 29
## 6 R. Lewando… Poland FC B… 28 91 91 79 80
## 7 De Gea Spain Manc… 26 90 92 57 38
## 8 E. Hazard Belgium Chel… 26 90 91 93 54
## 9 T. Kroos Germany Real… 27 90 90 60 60
## 10 G. Higuaín Argentina Juve… 29 90 90 78 50
## # … with 17,066 more rows, and 32 more variables: agility <dbl>, balance <dbl>,
## # ball_control <dbl>, composure <dbl>, crossing <dbl>, curve <dbl>,
## # dribbling <dbl>, finishing <dbl>, free_kick_accuracy <dbl>,
## # gk_diving <dbl>, gk_handling <dbl>, gk_kicking <dbl>, gk_positioning <dbl>,
## # gk_reflexes <dbl>, heading_accuracy <dbl>, interceptions <dbl>,
## # jumping <dbl>, long_passing <dbl>, long_shots <dbl>, marking <dbl>,
## # penalties <dbl>, positioning <dbl>, reactions <dbl>, short_passing <dbl>, …
summary(fifa$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 16.00 21.00 25.00 25.11 28.00 47.00
fifa_group <-fifa %>% mutate(
# Create categories
age_group = dplyr::case_when(
age <= 21 ~ "very_young",
age > 21 & age <= 25 ~ "young",
age > 25 & age <= 28 ~ "prime",
age > 28 ~ "experienced"
),
# Convert to factor
age_group = factor(age_group,level = c("very_young", "young","prime", "experienced"))
)
fifa_group
## # A tibble: 17,076 × 41
## name nationality club age overall potential acceleration aggression
## <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Cristiano … Portugal Real… 32 94 94 89 63
## 2 L. Messi Argentina FC B… 30 93 93 92 48
## 3 Neymar Brazil Pari… 25 92 94 94 56
## 4 L. Suárez Uruguay FC B… 30 92 92 88 78
## 5 M. Neuer Germany FC B… 31 92 92 58 29
## 6 R. Lewando… Poland FC B… 28 91 91 79 80
## 7 De Gea Spain Manc… 26 90 92 57 38
## 8 E. Hazard Belgium Chel… 26 90 91 93 54
## 9 T. Kroos Germany Real… 27 90 90 60 60
## 10 G. Higuaín Argentina Juve… 29 90 90 78 50
## # … with 17,066 more rows, and 33 more variables: agility <dbl>, balance <dbl>,
## # ball_control <dbl>, composure <dbl>, crossing <dbl>, curve <dbl>,
## # dribbling <dbl>, finishing <dbl>, free_kick_accuracy <dbl>,
## # gk_diving <dbl>, gk_handling <dbl>, gk_kicking <dbl>, gk_positioning <dbl>,
## # gk_reflexes <dbl>, heading_accuracy <dbl>, interceptions <dbl>,
## # jumping <dbl>, long_passing <dbl>, long_shots <dbl>, marking <dbl>,
## # penalties <dbl>, positioning <dbl>, reactions <dbl>, short_passing <dbl>, …
Visualizing relationship between shot_power and long_passing
library(viridis)
## Loading required package: viridisLite
my_fifa_plot_1 <- ggplot(data = fifa_group) +
geom_point(aes(x = shot_power, y = long_passing,
color=age_group), alpha = 0.5)+
scale_color_viridis(discrete = TRUE)+
scale_x_log10() +
labs(title = "Relationship between shot_power and long_passing",
subtitle = "fifa datset",
caption = "Data source: reisanar/datasets")+
theme(plot.title = element_text(hjust = 0.5, size = 14),
plot.subtitle = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 1, face = "italic"))
fifa_plot<- my_fifa_plot_1 +annotate("text", x = c(50,50), y = c(50,50),
label = c("Long-pass ~ shot power") , color="white",
size=4 , angle=45, fontface="bold")
fifa_plot
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
ggplotly(fifa_plot)
Visualizing relation ship between potential and standing_tackle
library(viridis)
my_fifa_plot_2 <- ggplot(data = fifa_group) +
geom_point(aes(x = potential, y = standing_tackle,
color=age_group), alpha = 0.5)+
scale_color_viridis(discrete = TRUE)+
scale_x_log10() +
labs(title = "Relationship between potential and standing_tackle",
subtitle = "fifa datset",
caption = "Data source: reisanar/datasets")+
theme(plot.title = element_text(hjust = 0.5, size = 14),
plot.subtitle = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 1, face = "italic"))
my_fifa_plot_2
library(plotly)
ggplotly(my_fifa_plot_2)
spatial visualization using florida lakes data set: plotting to visualize the number of lakes in state of florida and especially the polk country if the name lakleand has any relation to the number of lakes here Florida Lakes Dataset contains: 4234 rows and 7 columns. 1.PERIMETER 2.NAME 3.COUNTY 4.OBJECTID 5.SHAPEAREA 6.SHAPELEN 7.geometry
library(sf)
## Linking to GEOS 3.9.1, GDAL 3.3.2, PROJ 7.2.1; sf_use_s2() is TRUE
florida_shapes <- read_sf("C:/Susmitha Chereddy/Data_visualization/Mini_project_2_chereddy/data/Florida_Lakes/Florida_Lakes/Florida_Lakes.shp")
florida_shapes
## Simple feature collection with 4243 features and 6 fields
## Geometry type: MULTIPOLYGON
## Dimension: XY
## Bounding box: xmin: -87.42774 ymin: 25.02625 xmax: -80.03097 ymax: 31.00254
## Geodetic CRS: WGS 84
## # A tibble: 4,243 × 7
## PERIMETER NAME COUNTY OBJECTID SHAPEAREA SHAPELEN geometry
## <dbl> <chr> <chr> <int> <dbl> <dbl> <MULTIPOLYGON [°]>
## 1 11082. Lake … ORANGE 1 1818000. 11082. (((-81.34813 28.62354, -…
## 2 2834. Black… ESCAM… 2 31380. 2834. (((-87.42029 30.49087, -…
## 3 18768. Lake … HIGHL… 3 13601177. 18768. (((-81.4614 27.46472, -8…
## 4 493. Halfm… ESCAM… 4 6337. 493. (((-87.3131 30.74034, -8…
## 5 5663. Cresc… ESCAM… 5 338242. 5663. (((-87.27591 30.4692, -8…
## 6 317. Black… SANTA… 6 2380. 317. (((-87.26869 30.69546, -…
## 7 181. Beave… ESCAM… 7 1381. 181. (((-87.27064 30.70558, -…
## 8 1376. Salte… ESCAM… 8 24421. 1376. (((-87.26273 30.94937, -…
## 9 1914. Forty… SANTA… 9 178663. 1914. (((-87.18693 30.81357, -…
## 10 328. Hutso… SANTA… 10 7838. 328. (((-87.14079 30.96851, -…
## # … with 4,233 more rows
Visualizing all lakes in Florida
ggplot()+
geom_sf(data = florida_shapes, aes(fill = SHAPEAREA),
color = "black", size = 0.15) +
scale_fill_gradient(labels = scales::comma)+
theme(legend.position = "right")+
scale_fill_continuous(low="Darkblue", high="blue",
guide="colorbar",na.value="DarkGrey")+
labs(title = "Map of All Lakes in state of Florida",
subtitle = "Lakes shown in blue",
caption = "Data source: reisanar/datasets")+
theme(plot.title = element_text(hjust = 0.5, size = 14),
plot.subtitle = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 1, face = "italic"))
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
Visualizing lakes in Orange county
florida_shapes %>%
filter(COUNTY == "ORANGE") %>%
ggplot() +
geom_sf(aes(fill = SHAPEAREA),
color = "black", size = 0.15) +
scale_fill_gradient(labels = scales::comma)+
theme(legend.position = "right")+
scale_fill_continuous(low="navyblue", high="blue",
guide="colorbar",na.value="DarkGrey")+
labs(title = "Map of All Lakes in Orange County",
subtitle = "Orange County: State of Florida",
caption = "Data source: reisanar/datasets")+
theme(plot.title = element_text(hjust = 0.5, size = 14),
plot.subtitle = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 1, face = "italic"))
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
Visualizing lakes in polk county
florida_shapes %>%
filter(COUNTY == "POLK") %>%
ggplot() +
geom_sf(aes(fill = SHAPEAREA),
color = "black", size = 0.15) +
scale_fill_gradient(labels = scales::comma)+
theme(legend.position = "right")+
scale_fill_continuous(low="navyblue", high="blue",
guide="colorbar",na.value="DarkGrey")+
labs(title = "Map of All Lakes in PASCO County",
subtitle = "Orange County: State of Florida",
caption = "Data source: reisanar/datasets")+
theme(plot.title = element_text(hjust = 0.5, size = 14),
plot.subtitle = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 1, face = "italic"))
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
Visualizing lakes of florida with Perimeter as fill
ggplot()+
geom_sf(data = florida_shapes, aes(fill = PERIMETER),
color = "black", size = 0.15) +
scale_fill_gradient(labels = scales::comma)+
theme(legend.position = "right")+
scale_fill_continuous(low="Darkblue", high="blue",
guide="colorbar",na.value="DarkGrey")+
labs(title = " Perimeter Map of All Lakes in state of Florida",
subtitle = "Lakes shown in blue",
caption = "Data source: reisanar/datasets")+
theme(plot.title = element_text(hjust = 0.5, size = 14),
plot.subtitle = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 1, face = "italic"))
## Scale for 'fill' is already present. Adding another scale for 'fill', which
## will replace the existing scale.
Visualization of the model on Housing data set: here we are visualizing the prediction of House prices using Housing dataset and see how different variables affect the resultant prices WestRoxbury dataset contains 5802 rows with 14 columns. 1.Total_Value 2.TAX 3.LOT_SQFT 4.YR BUILT 5.GROSS AREA 6.LIVING AREA 7.FLOORS 8.ROOMS 9.BEDROOMS 10.FULL_BATH 11.HALF_BATH 12.KITCHEN 13.FIREPLACE 14.REMODEL
WestRoxbury <- read_csv("C:/Susmitha Chereddy/Data_visualization/Mini_project_2_chereddy/data/WestRoxbury.csv") %>% rename (Total_Value = `TOTAL VALUE`,LOT_SQFT=`LOT SQFT`,GROSS_AREA=`GROSS AREA`,FULL_BATH=`FULL BATH`,LIVING_AREA=`LIVING AREA`,HALF_BATH=`HALF BATH`)
## Rows: 5802 Columns: 14
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): REMODEL
## dbl (13): TOTAL VALUE, TAX, LOT SQFT, YR BUILT, GROSS AREA, LIVING AREA, FLO...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
WestRoxbury
## # A tibble: 5,802 × 14
## Total_Value TAX LOT_SQFT `YR BUILT` GROSS_AREA LIVING_AREA FLOORS ROOMS
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 344. 4330 9965 1880 2436 1352 2 6
## 2 413. 5190 6590 1945 3108 1976 2 10
## 3 330. 4152 7500 1890 2294 1371 2 8
## 4 499. 6272 13773 1957 5032 2608 1 9
## 5 332. 4170 5000 1910 2370 1438 2 7
## 6 337. 4244 5142 1950 2124 1060 1 6
## 7 359. 4521 5000 1954 3220 1916 2 7
## 8 320. 4030 10000 1950 2208 1200 1 6
## 9 334. 4195 6835 1958 2582 1092 1 5
## 10 409. 5150 5093 1900 4818 2992 2 8
## # … with 5,792 more rows, and 6 more variables: BEDROOMS <dbl>,
## # FULL_BATH <dbl>, HALF_BATH <dbl>, KITCHEN <dbl>, FIREPLACE <dbl>,
## # REMODEL <chr>
summary(WestRoxbury$FLOORS)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 2.000 1.684 2.000 3.000
Linear model Total_Value~LOT_SQFT
ggplot(WestRoxbury, aes(x = LOT_SQFT, y = Total_Value)) +
geom_point() +
geom_smooth(method = "lm",formula = "y ~ x") +
theme_minimal()+
labs(title = "Interaction between Total House Value and LOT AREA",
subtitle = "Total value in $$ and LOT Area in SQFT",
caption = "Data source: reisanar/datasets")+
theme(plot.title = element_text(hjust = 0.5, size = 14),
plot.subtitle = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 1, face = "italic"))
Linear model Total_Value~GROSS_AREA
ggplot(WestRoxbury, aes(x = GROSS_AREA, y = Total_Value)) +
geom_point() +
geom_smooth(method = "lm",formula = "y ~ x") +
theme_minimal()+
labs(title = "Interaction between Total House Value and GROSS AREA",
subtitle = "Total value in $$ and GROSS Area in SQFT",
caption = "Data source: reisanar/datasets")+
theme(plot.title = element_text(hjust = 0.5, size = 14),
plot.subtitle = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 1, face = "italic"))
Linear model Total_Value~LIVING_AREA
ggplot(WestRoxbury, aes(x = LIVING_AREA, y = Total_Value)) +
geom_point() +
geom_smooth(method = "lm",formula = "y ~ x") +
theme_minimal()+
labs(title = "Interaction between Total House Value and LIVING AREA",
subtitle = "Total value in $$ and LIVING Area in SQFT",
caption = "Data source: reisanar/datasets")+
theme(plot.title = element_text(hjust = 0.5, size = 14),
plot.subtitle = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 1, face = "italic"))
Linear Model Total_Value ~ LOT_SQFT + HALF_BATH + FLOORS
library(broom)
house_model <- lm(Total_Value ~ LOT_SQFT + HALF_BATH + FLOORS, data= WestRoxbury)
house_coefs <- tidy(house_model, conf.int = TRUE) %>%
filter(term != "(Intercept)") # We can typically skip plotting the intercept, so remove it
house_coefs
## # A tibble: 3 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 LOT_SQFT 0.0184 0.000343 53.7 0 0.0178 0.0191
## 2 HALF_BATH 29.0 1.80 16.1 4.67e- 57 25.5 32.5
## 3 FLOORS 88.2 2.15 41.0 2 e-323 84.0 92.4
plotting graph of estimates for the house Model (Total_Value ~ LOT_SQFT + HALF_BATH + FLOORS)
ggplot(house_coefs,
aes(x = estimate,
y = fct_rev(term))) +
geom_pointrange(aes(xmin = conf.low,
xmax = conf.high)) +
geom_vline(xintercept = 0,
color = "purple") +
theme_minimal()+
labs(title = " Graph of Estimates used in the model (LOT_SQFT,HALF_BATH,FLOORS)",
subtitle = "LOT Area in SQFT Floor and half bath in levels 1,2,3",
caption = "Data source: reisanar/datasets")+
theme(plot.title = element_text(hjust = 0.5, size = 14),
plot.subtitle = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 1, face = "italic"))
house_new_data <- expand_grid(
LOT_SQFT = mean(WestRoxbury$LOT_SQFT),
FLOORS = c(1,2,3),
HALF_BATH = c(1,2,3))
head(house_new_data)
## # A tibble: 6 × 3
## LOT_SQFT FLOORS HALF_BATH
## <dbl> <dbl> <dbl>
## 1 6278. 1 1
## 2 6278. 1 2
## 3 6278. 1 3
## 4 6278. 2 1
## 5 6278. 2 2
## 6 6278. 2 3
predicted_house <- augment(
house_model,
newdata = house_new_data,
se_fit = TRUE
)
head(predicted_house)
## # A tibble: 6 × 5
## LOT_SQFT FLOORS HALF_BATH .fitted .se.fit
## <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 6278. 1 1 344. 2.02
## 2 6278. 1 2 373. 3.39
## 3 6278. 1 3 402. 5.04
## 4 6278. 2 1 432. 1.21
## 5 6278. 2 2 461. 2.54
## 6 6278. 2 3 490. 4.24
plot for fitted values for each row
ggplot(predicted_house,
aes(x = FLOORS, y = .fitted)) +
geom_ribbon(aes(ymin = .fitted +
(-1.96 * .se.fit),
ymax = .fitted +
(1.96 * .se.fit),
fill = HALF_BATH),
alpha = 0.5) +
geom_line(aes(color = HALF_BATH), size = 1) +
guides(fill = FALSE, color = FALSE) +
facet_wrap(vars(HALF_BATH)) +
theme_minimal()+
labs(title = "plot for fitted values for each row",
subtitle = "Total value in $$ and LOT Area in SQFT",
caption = "Data source: reisanar/datasets")+
theme(plot.title = element_text(hjust = 0.5, size = 14),
plot.subtitle = element_text(hjust = 0.5),
plot.caption = element_text(hjust = 1, face = "italic"))
## Warning: `guides(<scale> = FALSE)` is deprecated. Please use `guides(<scale> =
## "none")` instead.
What were the original charts you planned to create for this assignments? What steps were necessary for cleaning and preparing the data?
I have used 3 datasets for this mini project. For the first Interactive visualization plot, I wanted to identify the differences between power, passing, Tackle and potential of different players. Since there is no age group I used Age to group the data into four groups and then used to perform my analysis.
For the second dataset, since it was spatial visualization, i used the data as it is available. For the Third dataset, I change the variable names to avoid the extra spaces between the variables.
What story could you tell with your plots? What difficulties did you encounter while creating the visualizations? What additional approaches do you think can be use to explore the data you selected?
From the First interaction plot,
From the Spatial Visualization plot:
From the Visualization of Model, we could see that
How did you apply the principles of data visualizations and design for this assignment?